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This  report  summarizes  reeent  progress  by  Signal  Innovations  Group  (SIG)  in  supporting 
the  Naval  Research  Laboratory  (NRL)  on  development  of  a  new  low-frequency  sonar 
system.  SIG  has  the  tasks  of  developing  the  algorithms  and  transitioning  them  to  NRL, 
for  use  in  sea  tests.  The  discussion  below  provides  a  summary  of  the  following  items;  (i) 
a  kernel-based  matching  pursuits  classification  algorithm,  (ii)  life-long  learning,  (iii)  in 
situ  learning,  and  (iv)  a  discussion  of  the  features  used  within  the  algorithms.  Items  (i) 
and  (iv)  are  fully  transitioned  to  NRL,  and  have  been  employed  during  sea  tests.  Items 
(ii)  and  (iii)  are  currently  under  development  by  SIG,  in  cooperation  with  NRL. 

A,  Kernel  Matching  Pursuits  (KMP) 

Assume  the  feature  vector  is  a  t/-dimensional  real  vector  x  e  ,  which  we  wish 
to  map  to  a  label  y  e  {0,1} ;  label  y=l  may  correspond  to  a  mine,  andy=0  to  clutter.  We 
aim  to  learn  the  optimal  parameters  w  of  the  functional  relationship  y  =  f  (x,  w)  between 
d  independent  feature  variables  x  and  the  dependent  output  variable  y.  To  accomplish 
this  learning  task  we  are  provided  with  A  labeled  observations,  {x.,y.}^j  that  are 

assumed  to  be  independently  and  identically  drawn  from  an  unobservable  underlying 
distribution.  We  are  interested  in  learning  sparse  kernel  machines  of  functional  form 

/«(x)  =  X%,,A(c,,x)  +  w„,o  =  w^(|)„(x)  (1) 

1=1 

where  ^  is  the  bias  term,  A(  • ,  •)  is  a  kernel  function  measuring  the  similarity  between 
two  data  samples 

(|)„(-)  =  [1,A(Ci,-),A(c2,-),...,^(c„,-)]"  (2) 

with  {yJ^j  the  kernel- induced  basis  function  centered  at  c, ,  and 

are  the  weights  that  combine  the  basis  functions  in  the  summation,  and  the  subscript  n  is 
used  to  denote  the  number  of  basis  functions  being  used,  with  n<N.  In  the  context  of  the 
binary  classification  problem  consider  in  this  section,  a  given  x  is  mapped  to  an  estimated 


j  e  {0, 1 }  as  y  =  U[/(  jc  )  -  0.5] ,  where  U(«)  is  a  unit  step  function,  equal  to  one  for 
a  >  0  ,  and  equal  to  zero  otherwise. 

The  KMP  implements  a  set  of  functions  of  the  form  in  (3).  Assume  we  are  given  a 
training  set  {x.,  ,  where  x,-  is  the  input  and  its  expected  output.  The  weighted 

sum  of  squared  errors  between  the  expected  output  and  the  KMP  output  given  in  (3)  is 


where  P .  is  a  constant  responsible  for  quantifying  the  importance  of  the  fth  training 
sample  (x.,y.)  .  For  example,  1/p,.  may  represent  the  variance  of  the  ith  measurement; 

noisy  measurements  will  therefore  be  given  less  importance  when  learning  the  model.  In 
addition,  if  one  has  a  priori  knowledge  that  some  data  x,  are  in  some  sense  “better” 
representative  of  the  system  being  modeled  this  can  be  accounted  for  in  the  parameter  P, 
The  unknowns  in  (4)  are  the  centers  c .  of  the  basis  functions  in  (j)^ ,  and  the  weights  are 
represented  by  .  At  the  moment  we  suppose  c,  and  consequently  are  known  and 


aim  at  solving  for  w„ ;  below  we  address  determining  c/.  Then  the  value  of  w„  that 
minimizes  (4)  is  found  to  be 

(5) 

where  .  is  an  abbreviation  of  (|)„ (x.) ,  { • } .  = 

M, = («) 

is  the  Fisher  information  matrix. 

We  now  address  learning  the  optimal  c,.  An  nth  order  KMP  employs  n  basis 
functions.  According  to  the  definition  in  (3),  the  (n+l)th  order  KMP  is  inductively 
written  as 

/«u(x)  =  wL<l>„+i(x)  (7) 

where 


1.1  (•)  =  [1,  ^(c, ,  •),  •),...,  A(c„ ,  •),  ,•)]"  = 


(8) 


with  ,x,.) .  It  may  be  stressed  that  formulae  (13)-(14)  provide  a  crucial 

method  for  reducing  the  computational  complexity.  These  techniques  enable  very  fast 
design  of  kernel  machines  to  be  performed,  even  on  large  datasets. 

With  sufficient  training  data  points,  we  can  always  make  positive  definite. 

Then  is  also  positive  definite  and  it  holds  h  '  >  0 ,  which  guarantees  be{K,c^^^)  is 
always  greater  than  zero.  Therefore,  from  (12B),  ,  which  means  appending  a  new 

basis  function  to  the  KMP  generally  leads  to  decrease  of  the  representation  error  on  the 
training  sample;  the  effect  on  generalization  is  more  complex  and  has  been  described  in 
the  previous  section. 

Since  be{K,c^^^)  is  dependent  on  the  center  of  the  new  basis  function,  we 


obtain  different  values  of  be{K,c^^^)  by  selecting  different  .  If  we  confine  to  be 


selected  from  the  training  data,  we  may  conduct  a  “greedy”  search  in  the  training  set  but 
with  the  previously  selected  data  excluded  to  avoid  repetition,  selecting  the  datum  that 
maximizes  (13).  Formally,  we  have 

=  argmax,^,^  .  5e(^,x,)  (15) 

l<k<N 

After  is  determined,  we  update  the  weights  using  (12A)  and  the  Fisher  information 
matrix. 

From  (13)  8e{K,c^^i)  depends  on  the  functional  form  of  the  kernel  as 

well  as  on  support  samples  .  This  allows  us  to  optimize  the  kernel  to  gain  further 
error  reduction.  A  simple  approach  to  take  is  to  first  conduct  a  “greedy”  search  of  in 
the  training  set,  for  a  fixed  kernel,  and  then  fix  and  optimize  the  parameters  of  the 
kernel.  For  radial  basis  function  (RBF)  kernels,  the  only  parameter  other  than  is  the 
kernel  width,  thus  optimization  of  RBF  kernels  with  fixed  is  a  one-dimensional 
search  for  the  kernel  width.  It  is  also  possible  to  optimize  and  the  kernel  width 
simultaneously,  but  then  is  treated  as  a  free  parameter  and  is  no  longer  confined  to 

the  training  set.  Another  possibility  is  optimization  over  kernels  of  different  functional 
forms,  which  offers  greater  diversity  of  the  basis  functions  available  to  the  KMP. 

B.  Life  Long  Learning 

Assume  that  an  MCM  sensor  system  has  previously  performed  M-\  sensing  tasks, 
with  each  task  characterized  by  a  particular  environment,  mines  and  clutter.  Now 
consider  a  new  sensing  task,  for  a  total  of  M  tasks.  Assume  that  the  mth  task  is 
characterized  by  labeled  signatures,  i.e.,  the  labeled  data  for  task  m  are 

Dm  =  {{Xnm  :  n  =  1,...,  ,  where  is  a  tZ-dimensional  real  feature  vector  and 

y„m  ^  1}  is  the  associated  label.  Our  objective  is  to  design  a  classifier  for  the  new 

task  M  while  leveraging  the  related  information  available  from  the  previous  M-\  tasks. 
The  algorithm  discussed  below  automatically  determines  which  of  the  M-\  previous  tasks 


are  relevant  for  learning  an  algorithm  for  task  M,  while  minimizing  the  importanee  of  the 
tasks  that  are  not  relevant. 

The  learning  algorithm  diseussed  here  simultaneously  designs  a  classifier  for  each 
of  the  M  tasks,  and  in  each  case  information  (data)  from  the  other  tasks  is  shared  if 
deemed  relevant.  Consequently  there  are  two  important  applications  of  the  algorithm; 

•  Life-long  learning,  in  which  the  algorithm  trained  for  a  new  task  M  is  placed 
within  the  context  of  all  previous  M-\  tasks  {i.e.,  placed  within  context  of 
historical  data).  In  this  manner  the  algorithm  designed  for  a  new  task  exploits  all 
relevant  information  from  previous  tasks.  This  has  the  important  property  of 
reducing  the  quantity  of  labeled  data  required  for  each  of  the  individual  tasks, 
since  data  are  shared  among  all  tasks. 

•  For  multiple  MCM  platforms,  a  networked  suite  of  distributed  sensors  observe 
different  portions  of  the  environment.  Assume  M  sensor  platforms  collect  data. 
The  processing  of  these  data  may  be  viewed  as  M  tasks,  and  it  is  desirable  to 
integrate  the  execution  of  these  multiple  tasks,  yielding  multi-task  learning.  In 
multi-task  learning,  when  analyzing  a  particular  task,  data  from  the  other  M-1 
tasks  are  appropriately  exploited.  Consequently,  in  the  context  of  a  multi-platform 
UUV  solution,  the  data  from  each  platform  is  viewed  as  a  task,  and  the  multi-task 
learning  algorithm  optimally  shares  information  across  tasks. 

A  nonparametric  Bayesian  model  is  considered  for  jointly  learning  multiple  classifiers, 
each  corresponding  to  a  task,  with  an  associated  dataset.  In  particular,  we  employ  the 
Dirichlet  Process  Mixture  (DPM)  as  the  common  prior  on  the  model  parameters  of  the 
tasks.  The  model  automatically  identifies  task  clusters  via  Bayesian  inference.  The  main 
advantage  of  a  nonparametric  model  is  that  it  makes  no  assumptions  regarding  the 
underlying  distributions,  and  therefore  it  provides  a  richer  and  more  flexible 
representation  than  its  parametric  counterparts. 

Recall  that  the  labeled  data  for  task  m  are  represented  as 
Dm  =  {{Xnm  ,ynm)  •  ^  ^  our  objective  is  to  learn  classifiers  for  each  of  the 


M  tasks,  by  simultaneously  sharing  information  (data)  deemed  relevant  by  the  multi-task- 
learning  algorithm.  For  each  task  m  the  conditional  probability  of  label  given 
is  modeled  via  logistic  regression 


P^y  nm\^  ^nm)  ^^y  nm^  nm  )  (16) 

where  parameterizes  the  classifier  for  task  m,  and  a(x)  =  exp(x)  /[I  +  exp(x)] .  The 
goal  is  to  learn  m  jointly  such  that  the  resulting  classifiers  can  accurately 

predict  class  labels  for  new  test  samples.  The  hierarchical  model  of  Is 

specified  as 


O^G-G,  G-DP{a,G,) 


(17) 


where  DP{a,Gg)  is  a  Dirichlet  process  with  precision  parameter  a  and  base  distribution 
Gq  .  The  Dirichlet  process  is  used  to  account  for  the  uncertainty  of  G.  Using 
Sethuraman’s  stick-breaking  representation,  we  can  write 


«  k-\ 

G  =  7ti(v)  =  v^n(l-T),  v^~5eto(l,a),  (18) 


k=l 


i=l 


Using  (17)  and  (18)  we  have 


/c=l 


(19) 


where  is  a  cluster  membership  indicator  defined  as  =1  if  belongs  to  cluster 


k,  and  =  0  otherwise.  The  clustering  structure  of  represents  relatedness  among 
tasks. 

The  use  of  Dirichlet  processes  in  Bayesian  inference  represents  the  state  of  the  art 
in  Bayesian  analysis,  providing  flexibility  and  generality  not  available  in  traditional 
approaches.  However,  an  important  challenge  that  must  be  addressed  is  computation  of 


the  integrals  required  for  inferenee.  To  address  this  ehallenge  we  utilize  variational  Bayes 
(VB)  inferenee,  whieh  we  diseuss  next. 

Assume  the  model  parameters  of  interest  are  represented  by  the  veetor  6 ;  we 
hope  to  learn  these  parameters  based  on  observed  data  D.  For  density  funetion  estimation 
6  may  represent  the  parameters  of  a  Gaussian  mixture  model  (GMM),  while  in  a 
elassifieation  problem  6  may  represent  the  weights  w  in  the  ineomplete-data  logistie- 
regression  elassifier. 

Our  objeetive  is  to  obtain  the  posterior  probability  distribution  of  the  hidden 
variables  6  based  on  a  set  of  observed  variables  D  (for  GMM  design  the  data  D  is 
unlabeled,  while  for  the  logistie-regression  elassifier  D  is  labeled  or  imperfeetly  labeled 
data).  Sinee  an  exaet  inferenee  of  the  hidden  variables  6  based  on  the  observed  variables 
D  is  intraetable  for  all  but  the  simplest  model  struetures,  our  goal  is  to  find  a  traetable 
variational  distribution  Q{6)  that  elosely  approximates  the  true  posterior  distribution 

p{d\D). 

Let  p{D)  denote  the  marginal  probability  of  the  observed  data  D.  The  log- 
marginal  ean  be  written  as 

\np{D)  =  L{Q)  +  KL{Q\\F)  (20) 

where 

p{D\d)p{d) 

L{Q)  =  LQ{d)\n  '  (21) 

and 

KL{Q\F)=  (22) 

with  P'=  p{6\D)  .  The  summations  in  (21)  and  (22)  are  replaeed  by  integrals  if  the  hidden 
variables  6  are  eontinuous. 

Note  that  the  above  expression  is  true  for  any  approximating  variational 
distribution  Q{6).  The  iQxmKL{Q'^P')  represents  the  Kullbaek-Leibler  (KL)  divergenee 

between  the  true  posterior  p{6\D)  and  its  variational  approximation  Q{6).  Our  objeetive 

is  to  optimize  Q{6)  io  minimize  the  KL  divergenee  between  Q{6)  and  p{6\d)  . 


However,  sinee  the  posterior  density  function  p{6\D)  is  known,  and  is  the  subject  of  this 
analysis,  the  KL  divergence  in  (7)  cannot  be  evaluated.  However,  since  KL{Q\F  )  is 
always  non-negative,  the  term  L{Q)  forms  a  lower  bound  of  the  log-marginal,  \np{  6  ). 
Consequently,  minimization  of  KL{Q^P')  with  respect  to  Q  is  equivalent  to  maximization 

of  L{Q)  since  the  left  hand  side  h\p{D)  is  independent  of  the  variational  distribution  Q. 

All  of  the  terms  in  (6)  can  be  evaluated,  and  therefore  the  variational  Bayes  (VB) 

approximation  to  p{0\D)  reduces  to  attempting  to  determine  the  Q{6)  that  maximizes 
the  variational  expression  L{Q). 

For  the  sake  of  tractability,  we  assume  that  the  hidden  variables  are  independent 
of  each  other,  meaning  Q{6)  may  be  written  in  a  factorized  form  as  Q{d)  = 

where  { }  is  the  set  of  disjoint  hidden  variables  indexed  by  i  constituting  9  .  In 

variational  inference  we  optimize  the  factors  of  the  variational  distribution  one  at  a  time, 
cycling  sequentially  through  all  factors.  We  accomplish  this  by  separating  out  the  terms 
involving  a  factor  Qi{  6- )  (approximating  the  distribution  for  hidden  variable  6- ).  We  can 
therefore  maximize  the  lower  bound  L{Q)  with  respect  to  a  single  factor  Qi  (assuming  all 
Qj  are  temporarily  fixed),  and  then  cycling  through  each  hidden  variable  6-  in  turn 

replacing  the  current  distribution  Qi{6. )  with  a  revised  estimate  2*(0,)  . 

This  iterative  VB  analysis  can  be  performed  efficiently  if  each  Qi{6^ )  is  conjugate 
to  the  likelihood  function  with  all  6 ^  equal  to  a  constant.  Specifically,  this  conjugacy 

property  allows  the  update  equations  to  be  performed  analytically,  thereby  yielding  a  VB 
algorithm  with  computational  speed  commensurate  with  the  widely  used  EM  algorithm 
employed  in  ML  point  estimates  of  the  parameters  0 .  Fortunately,  many  models  have  a 
structure  that  is  directly  amenable  to  appropriate  conjugate  priors. 

C.  In  Situ  Learning 

Assume  that  the  labeled  data  with  which  classifier  design  may  be  performed  is 
denoted  Dl,  and  the  unlabeled  data  at  the  new  site  of  interest  is  represented  as  Du-  We 


consider  a  nonlinear  elassifier  based  on  a  set  of  Nb  basis  functions  ,  where  the 

basis  funetions  are  determined  using  the  teehniques  diseussed  above  in  the  eontext  of 
KMP.  We  again  utilize  the  kernel-based  funetion 

f{x;w)=  Zw„K{x,b„)  +  Wo  =  w'^K{x)  (23) 

n=l 

and  the  probability  thatx  is  assoeiated  with  label  y=l  is  expressed  as 

p{y  =  l|x,  w)  =a[f{x;  w)]  =  !/{!  +  exp[/(x;  »v)]}  (24) 

with  p(y  =  -l|jc,  w)  =  l-piy  =  l|x,  w) .  In  (23)  w  is  an  A^g+1  dimensional  veetor  eomposed 

of  the  weights  ,  with  the  A^g+1  dimensional  veetor  if(x)  defined  in  terms 

of  the  eomponents  K{x^n)-  The  funetion  K(x,Xn)  is  a  general  kernel  defining  the  similarity 
of  feature  veetors  x  and  x„.  The  radial  basis  funetion, 

K(x,x„  )  =  exp(  — 2  Ik  ■  '\/27ra^  ,  represents  one  elass  of  kernels  that  may  be 

2a 

employed.  It  is  important  to  emphasize  that  the  elassifier  in  (23)  and  (24)  yields  a 
probabilistie  measure  as  to  the  confidenee  that  a  given  feature  veetor  x  is  assoeiated  with 
a  given  label  y,  thereby  presenting  the  deeision  maker  with  a  level  of  algorithmie 
eonfidenee. 

Assume  the  kernel-based  elassifier  in  (23)  and  (24)  is  trained  using  the  Nl  labeled 
signatures  in  Dl.  As  a  eonsequenee  of  this  training  we  yield  a  posterior  estimate  of  the 

weights  given  the  training  data,  p{w\Di) .  We  may  eompute  the  information  aeerued  in  w 

via  the  data  Dl  via  the  Ag+1  x  Ag+1  dimensional  Fisher  information  matrix,  with  ij 
element 

^;,=-£[^^log;,(w|D,)]  (25) 

OW^  OWj 

As  is  well  known,  the  Cramer-Rao  bound  is  defined  by  the  inverse  of  the  Fisher 
information  matrix,  this  defining  the  minimum  varianee  with  whieh  one  may  estimate  the 


weights  w  given  the  finite  labeled  data  Dl.  To  a  good  approximation  the  Fisher 
information  matrix  based  on  Dl  may  be  expressed  as 

F{Dl)=  J.KiXnf  K{x„)a[f{x„,w)]a[-f{x„,w)] 

(26) 

We  may  now  quantify  the  maximum  information  that  may  be  added  if  we  aequire  a  label 
for  that  member  of  the  unlabeled  data  Du  for  whieh  label  aequisition  would  be  most 
informative 

max  „ 

d  =  trace  {F{Du)  +  K{xf  K {x)a[f{x„ ,  w)'\a[-f{x„ ,  w)] } 

x^  Uu  ^27) 

In  (27)  we  have  employed  the  traee,  but  any  matrix  measure  may  be  used,  sueh  as 
the  determinant.  In  any  ease,  (27)  quantifies  the  information  eontent  that  may  be  aeerued 
with  regard  to  estimating  the  elassifier  weights  w,  based  upon  aequiring  the  label  for  the 
single  most  informative  member  of  the  new  (testing)  data  Du. 

Using  the  measure  S  above,  one  may  quantify  whieh  element  of  xe  Djj  would 
be  most  informative  to  elassifier  design  if  it  eould  be  employed  within  the  training  phase. 
To  use  this  jce  Djj  while  training,  the  assoeiated  label  y  is  required.  Henee  6  provides 

feedbaek  as  to  whieh  element  ofZ){/is  most  desirable  for  label  aequisition.  This  label 
may  then  be  aequired  via  personnel,  by  near-range  possibly  unmanned  sensors,  or  via  an 
analyst. 

This  is  termed  in  situ  learning  beeause  the  algorithm  automatieally  infers  whieh 
unlabeled  signatures  from  the  site  of  interest  would  be  most  informative  to  elassifier 
design  if  the  assoeiated  labels  eould  be  aequired.  This  algorithm  may  be  applied  for  oases 
in  whieh  there  is  no  or  little  pre-existing  labeled  data  sets  for  training. 


D,  Feature  Extraction 

The  implementation  of  identifioation  algorithms  is  predioated  on  extraoting 
features  from  the  target  strength  data  oolleoted  from  unknown  objeots  that  are  in  an  area 
to  be  oleared  of  mines.  Save  the  neoessity  of  oolleoting  raw  data  oontaining  exploitable 
differenoes  between  mines  and  olutter,  feature  extraotion  is  paramount.  Our  overall 


design  clearly  delineates  feature  extraction  from  identification  using  the  extracted 
features.  The  unknown  nature  of  the  clutter  in  various  environments  we  plan  on 
measuring  in  the  near  future  precludes  discussion  of  the  final  features  that  will  be 
employed  in  the  operational  system.  However,  the  features  used  today  and  the 
methodology  used  to  select  them  are  discussed  in  turn.  Following  these  discussions  the 
issue,  which  was  deferred  earlier  in  the  document,  of  how  to  combine  the  advantages 
discriminative  approach  with  the  sequential  nature  of  the  data  is  discussed. 

Feature  extraction  converts  the  structural  acoustic  information  contained  in  the 
scattered,  multi-aspect,  acoustic  signals  from  targets  in  the  low  frequency  band  into  a 
form  that  can  be  utilized  by  the  identification  algorithms.  Signal  processing  techniques 
are  used  to  produce  multi-aspect  target  strength  (either  in  the  time  of  frequency  domain) 
from  the  scattered  signals.  Features  are  then  extracted  from  the  target  strength  to 
implement  mine  identification. 

To  date,  four  different  feature  sets  were  explored  when  analyzing  the 
identification  performance  of  the  system;  normalized  energy  in  sub-bands,  wavelet 
moments,  relaxed  matching  pursuits,  and  central  moments.  In  general,  the  normalized 
energy  features  worked  as  well  as  or  better  than  the  other  feature  sets  for  the  most 
challenging  identification  scenarios,  and  the  results  using  this  feature  set  are  presented  in 
previous  reports  about  the  program.  Thirty  six  equally  distributed  frequency  bands  are 
used  for  the  performance  estimation  of  the  system.  The  normalized  energy  features  are 
the  ratio  of  the  energy  in  a  frequency  band  normalized  by  the  total  energy  in  the  signal; 

lld/f# 

X  =  ^ - ,  where  P{f )  is  the  signal’s  spectrum,  and  /,  and  /j  define  the 

0 

frequency  band. 

The  normalized  energy  features  have  been  robust  to  the  channel  variations  do  to 
the  changing  depth  to  range  ratio  encountered  thus  far.  Intuitively,  this  is  due  to  the  fact 
that  the  spectral  beating,  controlled  by  the  arrival  times  that  change  when  the  depth  to 


range  ratio  changes,  is  averaged  out  by  the  integral  in  the  numerator  of  the  expression  for 
the  normalized  energy  features.. 

We  have  considered  two  methodologies  for  feature  selection.  The  more- 
sophisticated  approach  is  called  joint  feature  and  classifier  optimization  (JCFO),  it 
representing  an  extension  of  the  aforementioned  kernel-based  classifiers.  Specifically,  we 
employ  an  augmented  kernel  representation  of  the  following  form 

f{x-,w,d)  =  ^w^K{x,b^-,d)  +  WQ  =  w^K{x-,d)  (28) 

n=\ 


where  a  new  vector  0  is  introduced,  this  of  dimension  d,  corresponding  to  the 
dimensionality  of  the  feature  vector  x.  The  kernel  is  represented  as 

d 

^(x,6„;^)  =  exp[-Z0,(x,-h„,,)^]  (29) 

i=\ 

where  x/  is  the  ith  component  of  the  vector  x,  and  bn,i  is  the  ith  component  of  the  vector 
bn.  The  scalar  0 .  weights  the  importance  of  the  ith  feature  in  the  classifier.  In  the  JCFO 

algorithm  the  goal  is  to  learn  the  vectors  w  and  6 .  A  sparseness  prior  (regularizer)  is 
placed  on  both  of  these  parameters,  such  that  in  the  final  classifier  most  components  of  w 
and  0  are  zero  or  near-zero.  We  thereby  simultaneously  learn  which  basis  vectors  are 
most  relevant  for  classifier  design  (those  with  non-zero  corresponding  components  in  w), 
as  well  as  the  most-relevant  features  (those  with  non-zero  components  in  0).  This 
simultaneous  learning  of  the  classifier  design  and  the  associated  features  plays  an 
important  role  in  JCFO  performance,  since  the  optimal  features  are  a  function  of  the 
specific  classifier  employed,  and  vice  versa. 

While  the  JCFO  algorithm  represents  an  excellent  tool  for  classifier  design  and 
feature  selection,  the  fact  that  we  must  solve  for  two  vectors,  w  and  0 ,  makes  the 
algorithm  relatively  slow  for  large  data  sets.  We  therefore  also  utilize  the  simpler 
approach  of  designing  a  classifier  using  the  functional 


(30) 


d 

fix-,  w)  =  +  Wo  =  w^x 

where  now  the  dimensionality  of  the  veetor  w  is  equal  to  d,  the  number  of  features.  We 
plaee  a  sparseness  prior  on  w,  and  thereby  seleet  the  most  relevant  features  when 
performing  elassifier  design.  We  have  found  this  algorithm  to  often  serve  as  an  exeellent 
tool  for  designing  a  elassifier  and  seleeting  features,  and  the  eomputational  speed  of  this 
approaeh  is  typieally  quite  fast.  In  many  praetieal  applieations  one  may  utilize  (30)  to 
determine  an  initial  weighting  on  the  importanee  of  features,  with  the  insight  so  learned 
used  to  initialize  the  JCFO  algorithm,  yielding  improved  JCFO  eonvergenee  properties. 


